Homework 1 (Decision Trees and kNN)¶

ECS 271¶

Kay Royo¶

Part 1.¶

(not graded): The following are true/false questions. You don’t need to answer the questions. Just tell us which ones you can’t answer confidently in less than one minute. (You won’t be graded on this.) If you can’t answer at least 8, you should probably spend some extra time outside of class beefing up on elementary math. I would strongly suggest going through this math tutorial by Hal Daume: https://web.cs.ucdavis.edu/%7Ehpirsiav/courses/MLf22/math4ml.pdf

(a) $log(x) + log(y) = log(xy) $

Answer: True

(b) $log[ab^c] = log \ a + (log \ b)(log \ c)$

Answer: False; $log[ab^c] = log \ a + c(log \ b)$

(c) $ \frac{\partial}{\partial x} \sigma(x) = \sigma(x) \ \times \ (1-\sigma(x))$ where $\sigma(x) = \frac{1}{1+e^{-x}}$

Answer: False; $ \frac{\partial}{\partial x} \sigma(x) = \frac{\partial}{\partial x} \frac{1}{1+e^{-x}} = \frac{e^{x}}{(e^{x} +1)^{2}}$

(d) The distance between the point ($x_1, y_1$) and line $ax+by+c$ is $\frac{(ax_1+by_1+c)}{\sqrt{a^{2}+ b^{2}}}$

Answer: True

(e) $\frac{\partial}{\partial x} log \ x = - \frac{1}{x}$

Answer: False; $\frac{\partial}{\partial x} log \ x = \frac{1}{x}$

(f) $p(a|b) = \frac{p(a,b)}{p(b)}$

Answer: True; $p(a,b)$ is intersection of a and b

(g) $p(x | y, z) = p(x | y)p(x | z)$

Answer: False; $p(x | y, z) = \frac{p(x,y,z)}{p(y,z)} = \frac{ p(y|x,z)p(x|z)}{p(y|z)}$

(h) $C(n, k) = C(n−1, k −1) +C(n−1, k)$, where $C(n, k)$ is the number of ways of choosing $k$ objects from $n$

Answer: True

(i) $||\alpha u+v||^2 = \alpha ^2 ||u||^2 + ||v|^2$, where ||.|| denotes Euclidean norm, $\alpha$ is a scalar and $u$ and $v$ are vectors

Answer: True; If $u$ and $v$ are orthogonal aka $u.v = 0$

(j) $ |u^T v| \ge ||u|| \times ||v||$, where $|·|$ denotes absolute value and $u^⊤v$ is the dot product of $u$ and $v$

Answer: False; $ |u^T v| \le ||u|| \times ||v||$ (Cauchy-Schwarz Inequality)

(k) $ \int_{- \infty}^{\infty} dx \ exp \big[- \frac{\pi}{2} x^2 \big]= \sqrt{2}$

Answer: True

Part 2.¶

(not graded): Go though this Matlab tutorial by Stefan Roth: http://cs.brown.edu/courses/csci1950-g/docs/matlab/matlabtutorialcode.html

Part 3.¶

In class, we looked at an example where all the attributes were binary (i.e., yes/no valued). Consider an example where instead of the attribute “Morning?”, we had an attribute “Time” which specifies when the class begins.

(a) We can pick a threshold $\tau$ and use (Time < $\tau$) as a criteria to split the data in two. Explain how you might pick the optimal value of $\tau$.

Answer:

If given a dataset has a combination of binary and nonbinary attributes and a categorical target variable, then we need to pick the most useful attribute that gives the "best split" and maximum difference, we can consider Entropy, which can tell us the degree of randomness or uncertainties in our data that can be calculated using the formula below where $p_i$ is the probability of class i. $$E(S) = \sum_{i=1}^{c} -p_i log_2 p_i$$ We can then use a splitting criterion such as Information Gain, which measures the reduction of uncertainties in our data, which can be computed using the formula below, where the variable y denotes the target variable and the variable x represents some additional information or feature, such as Time, about the target variable. $$IG(x,y) = E(y) - E(y|x)$$ We want the addtional feature $x$ to help reduce the uncertainties around the target variable. Thus, the feature that maximimizes the information gain and reduces impurities can be used to make the decision. Once the useful attribute is determined, we can create a decision node that splits the data on the best attribute and then find the optimal value of $\tau$ to use for splitting the data into two.

To find the optimal value $\tau$ using for example, the attribute Time (which we can assume as the best attribute to use as a result the process mentioned above) to classify the data, we can then do the following,

  1. Sort the feature Time with values $T$={${t_1,\cdots , t_n }$}.
  2. For $i=1,\cdots, n-1$, we can consider a splitting criteria $\tau_i = \frac{t_i + t_(i+1)}{2}$, which is the midpoint of adjacent times (assuming that there is an existing method to compute the mid point between times).
  3. Compute the classification error for threshold split $\tau_i$
  4. Choose $\tau_i$ with the lowest classification error.

Here $\tau_i$ can be some midpoint time that results in purest nodes. This brute force approach of trying out every single time midpoint would be computationally expensive. Thus, we can also convert the variable Time into a categorical variable and perform the following.

  1. Convert Time to a categorical variable and save this as a new attribute Time_class. For example, categorize Time as either [24:00:00 am - 11:59:00 am] or [12:00:00 pm - 11:59:00 pm] assuming it is in military time. This would result in a two-way split.
  2. Sort the raw Time attribute and take the midpoint of adjacent times.
  3. Select time midpoint candidates for $\tau$. Among these candidates, choose the one the produces the purest nodes as the optimal value of $\tau$.

(b) In the decision tree learning algorithm discussed in class, once a binary attribute is used, the subtrees do not need to consider it. Explain why when there are continuous attributes this may not be the case.

Answer:

When there is a binary attribute, we can complete the split at the first level of the tree so the subtrees do not need to consider it. However, when there are continuous attributes, the subtrees need to consider it because it might require multiple tree levels to complete the split where there is a lowest possible impurity. Binary attributes result in a two-way split, whereas continuous attributes may result in multiway-split, where a leaf node of a decision can further form its own subtree and be split into new leaf nodes. In other words, the decision or outcome variable at the bottom of the tree may depend on other decisions or attributes further up the tree, which is why a continuous variable may be be considered by the subtrees.

Part 4.¶

Why memorizing the training data and doing table lookups is a bad strategy for learning? How do we prevent that in decision trees?

Answer:

There are various reasons why memorization is a bad strategy for learning. One reason is that memorizing the training data leads to overfitting. If a model is overfitted, then would not be able to generalize to unseen data. For example, if we train a polynomial regression model $y = \beta_0 + \beta_1x_1 + \beta_2 x_2^2 + \epsilon$ using a small dataset, then the model can memorize these three coefficients, predict the labels exactly, and reach zero training loss. The model would have a very high accuracy on the training data but the accuracy significantly decreases for a test data. Thus, memorizing training data and doing lookup tables only allows for predictions to be made on test data that is the same as the training data.

In order to prevent memorization or overfitting in decision trees, the following methods can be used.

  1. Pre-pruning - produce a tree with fewer branches by tuning the hyperparameters being used, which include minimum sample split and depth. By tuning the hyperparameters, we can stop generating a full tree during the training process.
  2. Post-pruning - produce a full tree and then remove some of its parts to prevent overfitting without affecting the overall accuracy. One example of post-pruning parameter is CCP (Cost Complexity Pruning).
  3. Larger dataset - The larger the dataset is being used for training, the more varied the training samples are for the model to train on.
  4. Ensembling methods such as Random Forest - combines the different decision trees generated using different samples to decrease the variation in the predictions.

After using these methods, the model must be evaluated using a validation set to ensure that the implementation of these methods has a positive effect i.e. higher accuracy, less error, etc.

Part 5.¶

What does the decision boundary of 1-nearest neighbor classifier for 2 points (one positive, one negative) look like?

Answer:

The decision boundary of 1-nearest neighbor classifier for 2 points (one positive, one negative) that belong to differenct classes would be piecewise linear bisector that is located equidistant from the two points. Since there are only two points, the decision boundary is very simple.

In [1]:
import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter(x=[-5, 5], y=[5, 5])
fig.add_trace(go.Scatter(x=[0,0], y=[10,0], mode='lines', name='trend line', line=dict(color='red')))
fig.show(renderer='notebook')

Part 6. kNN acuraccy using Euclidean & Manhattan distances¶

Does the accuracy of a kNN classifier using the Euclidean distance change if you (a) translate the data (b) scale the data (i.e., multiply the all the points by a constant), or (c) rotate the data? Explain. Answer the same for a kNN classifier using Manhattan distance

Answer:

i. kNN using Euclidean distance

(a) translate the data

Answer:

No, the accuracy of a kNN classifier using the Euclidean distance does not change if we translate the data. When we translate the data, we are simply shifting the position of the data points by a fixed amount in space. This translation does not change the relative distances between the data points. As a result, if we use the Euclidean distance as the distance metric for the kNN classifier, the distance between any two points will remain the same even after translation.

For example, if we have two data points A and B with coordinates (2, 3) and (4, 5), respectively. The distance between these two points can be calculated using the Euclidean distance formula as follows:

$$distance \ \ = \ \ \sqrt{\big((4 - 2)^2 + (5 - 3)^2\big)} = \sqrt{(8)}$$

If we translate both points by a fixed amount (dx, dy), their new coordinates become (2+dx, 3+dy) and (4+dx, 5+dy), respectively. The distance between the two points will still be the same:

$$distance \ \ = \ \ \sqrt{\big((4+dx - 2-dx)^2 + (5+dy - 3-dy)^2\big)} = \sqrt{(8)}$$

(b) scale the data (i.e., multiply the all the points by a constant)

Answer:

No, the accuracy of a kNN classifier using the Euclidean distance does not change if we scale the data because scaling the data changes the scale of the data points, but it does not change their relative distances. Therefore, if we use the Euclidean distance as the distance metric for the kNN classifier, the distance between any two points will still be the same relative to each other after scaling.

For example, consider two data points A and B with coordinates (2, 3) and (4, 5), respectively. The distance between these two points can be calculated using the Euclidean distance formula as follows:

$$distance \ = \ \sqrt{((4 - 2)^2 + (5 - 3)^2)} = \sqrt{(8)}$$

Now, suppose we scale the data points by a factor of 2, such that the new coordinates become (4, 6) and (8, 10), respectively. The distance between the two points will still be the same:

$$distance \ = \ \sqrt{((8 - 4)^2 + (10 - 6)^2)} = \sqrt{(32)} = \sqrt{(2^2 * 8)} = 2 * \sqrt{(8)}$$

Thus, the accuracy of a kNN classifier using the Euclidean distance does not change if we scale the data, as the relative distances between the data points remain the same. However, if we use other distance metrics that are sensitive to the scale of the data points, such as the Mahalanobis distance, then the accuracy of the kNN classifier may be affected by scaling.

(c) rotate the data

Answer:

No ,rotation won't change the accuracy of kNN using Euclidean distance. Rotating the data does not generally result in changes in its length, shape, and angle measures so we can expect the calculated distances to be the same. With this, we can infer that the accuracy of kNN using Euclidean distance may not change even if we rotate the data. However, a slight change in kNN accuracy may still be possible depending on the data itself, the degree of rotation, or the type of distance measure we're using. Based on the calculated distance below, it appears that Euclidean distance is not as sensitive to data rotation as Manhattan distance.

If we consider 3 points $x = (0, 1), y = (1, 0)$, and $z = (1, -2)$, then Euclidean distance, using $y$ as the pivot point, between the 3 points would be

$$||x-y||_2 = \sqrt{(0-1)^2 + (1-0)^2}= \sqrt{2}$$$$||y-z||_2 = \sqrt{(1-1)^2 +(0-(-2))^2} = \sqrt{4}=2$$$$||x-y||_2 < ||y-z||_2$$

which means that $x$ and $z$ are not the same distance from $y$.

  • If we consider a 45-degree rotation matrix $ A = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \quad $

Then, the new co-ordinates after the 45 degrees rotation would be given by, $ \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} $ $ \begin{bmatrix} x \\ y\\ \end{bmatrix} $

Hence, x, y and z get transformed to $x^*$, $y^*$ and $z^*$ as follows,

$$x^* = Ax = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 0 \\ 1\\ \end{bmatrix} = \begin{bmatrix} \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}}\\ \end{bmatrix}$$$$y^* = Ay = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ 0\\ \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}$$$$z^* = Az = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ -2\\ \end{bmatrix} = \begin{bmatrix} \frac{3}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}}\\ \end{bmatrix}$$

Thus, the new Euclidean distance after a 45-degree rotation becomes $$||x^*-y^*||_2 = \sqrt{\Bigg(\frac{-1}{\sqrt{2}} - \frac{1}{\sqrt{2}}\Bigg)^2+\Bigg(\frac{1}{\sqrt{2}} - \frac{1}{\sqrt{2}}\Bigg)^2} =\sqrt{2}$$

$$||y^*-z^*||_2 =\sqrt{\Bigg(\frac{1}{\sqrt{2}} - \frac{3}{\sqrt{2}}\Bigg)^2+\Bigg(\frac{1}{\sqrt{2}} - \frac{-1}{\sqrt{2}} \Bigg)^2} =\sqrt{4}=2$$

Hence, ordering is preserved with Euclidean distance after 45-degree rotation since we get $$||x^*-y^*||_2< ||y^*-z^ *||_2 $$

  • If we consider a 90- degree rotation matrix $ B = \begin{bmatrix} 0 & -1 \\ 1 & 0\\ \end{bmatrix} \quad $

The points x, y and z get transformed to $x^{**}$, $y^{**}$ and $z^{**}$ as follows,

$$x^{**} = Bx = \begin{bmatrix} 0 & -1\\ 1 & 0\\ \end{bmatrix} \begin{bmatrix} 0 \\ 1\\ \end{bmatrix} = \begin{bmatrix} -1 \\ 0\\ \end{bmatrix}$$$$y^{**} = By = \begin{bmatrix} 0 & -1\\ 1 & 0\\ \end{bmatrix} \begin{bmatrix} 1 \\ 0\\ \end{bmatrix} = \begin{bmatrix} 0 \\ 1\\ \end{bmatrix}$$$$z^{**} = Bz = \begin{bmatrix} 0 & -1\\ 1 & 0\\ \end{bmatrix} \begin{bmatrix} 1 \\ -2\\ \end{bmatrix} = \begin{bmatrix} 2 \\ 1\\ \end{bmatrix}$$

Thus, the new Euclidean distance after a 90-degree rotation becomes $$||x^{**}-y^{**}||_2 = \sqrt{(-1 - 0)^2+(0 - 1)^2} =\sqrt{2}$$

$$||y^{**}-z^{**}||_2 =\sqrt{(0 - 2)^2+(1 - 1)^2} =\sqrt{4}=2$$

Hence, ordering is still preserved with Euclidean distance after 90-degree rotation since we get $$||x^{**}-y^{**}||_2< ||y^{**}-z^{**}||_2 $$

  • If we consider a 180-degree rotation matrix $ C = \begin{bmatrix} -1 & 0 \\ 0 & -1\\ \end{bmatrix} \quad $ The points x, y and z get transformed to $x^{***}$, $y^{***}$ and $z^{***}$ as follows,
$$x^{***} = Cx = \begin{bmatrix} -1 & 0 \\ 0 & -1\\ \end{bmatrix} \begin{bmatrix} 0 \\ 1\\ \end{bmatrix} = \begin{bmatrix} 0 \\ -1\\ \end{bmatrix}$$$$y^{***} = Cy = \begin{bmatrix} -1 & 0 \\ 0 & -1\\ \end{bmatrix} \begin{bmatrix} 1 \\ 0\\ \end{bmatrix} = \begin{bmatrix} -1 \\ 0\\ \end{bmatrix}$$$$z^{***} = Cz = \begin{bmatrix} -1 & 0 \\ 0 & -1\\ \end{bmatrix} \begin{bmatrix} 1 \\ -2\\ \end{bmatrix} = \begin{bmatrix} -1 \\ 2\\ \end{bmatrix}$$

Thus, the new Euclidean distance after a 90-degree rotation becomes $$||x^{***}-y^{***}||_2 = \sqrt{(0 - (-1))^2+(-1 - 0)^2} =\sqrt{2}$$

$$||y^{***}-z^{***}||_2 =\sqrt{(-1 - (-1))^2+(0 - 2)^2} =\sqrt{4}=2$$

Hence, ordering is still preserved with Euclidean distance after 180-degree rotation since we get $$||x^{***}-y^{***}||_2< ||y^{***}-z^{***}||_2 $$

ii. kNN using Manhattan distance

(a) translate the data

Answer:

Similar to the answer provided in part (i.a) above, translating the data should not change the accuracy of kNN using Manhattan distance. if we consider three 2D points $x=(0,1), y=(1,2)$, and $z=(2,1)$ with features x and y, then we can compute the Manhattan distances between the points to see how they change before and after translation. If we translate these 3 points by adding a constant=1 to one of the features (let's choose feature X), then we would get the transformed versions $x_1=(1,1), y_1=(2,2)$, and $z_1=(3,1)$. The Euclidean distance of the original and translated points, using $y$ and $y_1$ as the pivot point, would be

$$||x-y||_1 = |0-1|+ |1-2|= 2$$$$||y-z||_1 = |1-2|+|2-1| = 2$$$$||x_1-y_1||_1 = |1-2| + |1-2|= 2$$$$||y_1-z_1||_1 = |2-3| +|2-2| = 1$$

Thus, the distances does change after translation since $||x-y||_1 = ||y-z||_1$ becomes $||x_1-y_1||_1 > ||y_1-z_1||_1$

If we add a constant=1 to both features X and Y, we get the new translated points as $x_2=(1,2), y_2=(2,3)$, and $z_2=(3,2)$, then the Euclidean distances between these 3 points are

$$||x_2-y_2||_1 = |1-2| + |3-2|= 2$$$$||y_2-z_2||_1 = |3-2| +|2-3| = 2$$

Thus, the distance ordering does not change after translation since $||x-y||_1 = ||y-z||_1$ remains $||x_1-y_1||_1 = ||y_1-z_1||_1$

Based on these, we can say that translating the data can change the accuracy of kNN using Manhattan distance depending on how we are translating the data.

(b) scale the data (i.e., multiply the all the points by a constant)

Answer:

The accuracy of a kNN classifier using the Manhattan distance does not change if we scale the data, as the relative distances between the data points remain the same. The Manhattan distance is calculated as the sum of absolute differences between the coordinates of two data points. Scaling the data will change the absolute differences between the coordinates, but it will not change their relative differences. Therefore, if we use the Manhattan distance as the distance metric for the kNN classifier, the relative distances between any two points will still be the same after scaling.

For example, consider two data points A and B with coordinates (2, 3) and (4, 5), respectively. The Manhattan distance between these two points can be calculated as follows:

$$distance \ = \ |4 - 2| + |5 - 3| = 4$$

Now, suppose we scale the data points by a factor of 2, such that the new coordinates become (4, 6) and (8, 10), respectively. The Manhattan distance between the two points will still be the same:

$$distance \ = \ |8 - 4| + |10 - 6| = 2*4 = 8$$

(c) rotate the data

Answer:

Similar to the reason provided in part (i.c), the accuracy of kNN classifier using Manhattan distance can change if the data is rotated. However, this change should depend on the degree of rotation and the data itself.

If we consider 3 points $x = (0, 1), y = (1, 0)$, and $z = (1, -2)$, then the Manhattan distance, using $y$ as the pivot point, between the 3 points would be

$$||x-y||_1= ||y-z||_1 = 2$$$$|0-1| + |1-0|= |1-1| +|0-2|= 2$$

which means that $x$ and $z$ are the same distance from $y$. If we consider a 45-degree rotation matrix $ A = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \quad $

Then, the new co-ordinates after the 45 degrees rotation would be given by, $ \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} $ $ \begin{bmatrix} x \\ y\\ \end{bmatrix} $

Hence, x, y and z get transformed to $x*$, $y*$ and $z*$ as follows,

$$x* = Ax = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 0 \\ 1\\ \end{bmatrix} = \begin{bmatrix} \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}}\\ \end{bmatrix}$$$$y* = Ay = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ 0\\ \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}$$$$z* = Az = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ -2\\ \end{bmatrix} = \begin{bmatrix} \frac{3}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}}\\ \end{bmatrix}$$

Thus, the new Manhattan distance after a 45-degree rotation becomes $$||x*-y*||_1 = \Bigg|\frac{-1}{\sqrt{2}} - \frac{1}{\sqrt{2}}\Bigg|+\Bigg|\frac{1}{\sqrt{2}} - \frac{1}{\sqrt{2}} \Bigg| = \sqrt{2}$$

$$||y*-z*||_1 =\Bigg|\frac{1}{\sqrt{2}} - \frac{3}{\sqrt{2}}\Bigg|+\Bigg|\frac{1}{\sqrt{2}} - \frac{-1}{\sqrt{2}} \Bigg| =\frac{4}{\sqrt{2}}$$

Hence, ordering is not preserved with Manhatan distance after rotation since we get $$||x*-y*||_1< ||y*-z*||_1 $$

Part 7. kNN implementation from scratch¶

Implement kNN in Matlab or Python for handwritten digit classification and submit all codes and plots:

(a) Download MNIST digit dataset (60,000 training and 10,000 testing data points) and the starter code from the course page. Each row in the matrix represents a handwritten digit image. The starter code shows how to visualize an example data point in Matlab. The task is to predict the class (0 to 9) for a given test image, so it is a 10-way classification problem.

In [2]:
# Import dependencies 
import pandas as pd 
import nbconvert
import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import operator 
from operator import itemgetter
import plotly.express as px
import timeit
In [3]:
#Loading the data
M = loadmat('data\MNIST_digit_data.mat')
images_train,images_test,labels_train,labels_test= M['images_train'],M['images_test'],M['labels_train'],M['labels_test']
In [4]:
#just to make all random sequences on all computers the same.
np.random.seed(1)
In [5]:
#randomly permute data points
inds = np.random.permutation(images_train.shape[0])
images_train = images_train[inds]
labels_train = labels_train[inds]
In [5]:
inds = np.random.permutation(images_test.shape[0])
images_test = images_test[inds]
labels_test = labels_test[inds]
In [6]:
#if you want to use only the first 1000 data points.
images_train = images_train[0:1000,:]
labels_train = labels_train[0:1000,:]
In [7]:
#show the 10'th train image
i=10
im = images_train[i,:].reshape((28,28),order='F')
plt.imshow(im)
plt.title('Class Label:'+str(labels_train[i][0]))
plt.show()
In [8]:
print(images_train.shape, labels_train.shape)
print(images_test.shape, labels_test.shape)
(1000, 784) (1000, 1)
(10000, 784) (10000, 1)

(b) Write a Matlab or Python function that implements kNN for this task and reports the accuracy for each class (10 numbers) as well as the average accuracy (one number).

[acc acc_av] = kNN(images train, labels train, images test, labels test, k)

where acc is a vector of length 10 and acc_av is a scalar. Look at a few correct and wrong predictions to see if it makes sense. To speed it up, in all experiments, you may use only the first 1000 testing images.

In [13]:
# Class ‘KNN’ with initialized instances 

class KNN:
    def __init__(self, k):
        self.k = k
        
    def fit_kNN(self, images_train, labels_train):
        self.images_train = images_train
        self.labels_train = labels_train
        
    def l2_distance(self, x1, x2):
        return np.sqrt(np.sum((x1-x2)**2))
    
    def predict_kNN(self, images_test):
        pred_knn = [] # a list to store predictions
        
        #iterate through test data
        for i in range(len(images_test)):
            
            #calculate distances between test and training data 
            d = np.array([self.l2_distance(images_test[i], x_t) for x_t in self.images_train])
            
            #sort distances and only keep k neighbors 
            sorted_d = d.argsort()[:self.k]
            
            #initialize dict to count class occurrences in labels_train
            cnt_neighbors = {}
            
            #iterate over neighbors 
            for j in sorted_d:
                if self.labels_train[j][0] in cnt_neighbors:
                    cnt_neighbors[self.labels_train[j][0]] += 1
                else:
                    cnt_neighbors[self.labels_train[j][0]] = 1
                    
            #sort occurence values from most to least
            sorted_neigh_cnt = sorted(cnt_neighbors.items(), key=operator.itemgetter(1), reverse=True)
            pred_knn.append(sorted_neigh_cnt[0][0]) 
        return pred_knn
            
    def score_kNN(self, images_test, labels_test):
        pred = self.predict_kNN(images_test)
        #convert labels_test as nested list to flat array list 
        flat_labels_test = np.array([ item for elem in labels_test for item in elem])
        return (pred == flat_labels_test).sum() / len(labels_test)
    
    def class_acc(self, images_test, labels_test):
        pred = self.predict_kNN(images_test)
        #convert labels_test as nested list to flat array list 
        flat_labels_test = np.array([ item for elem in labels_test for item in elem])
        cm = pd.crosstab(flat_labels_test,pred)
        #print(cm)
        class_cm = np.array((cm.max(axis=1)/cm.sum(axis=1)))
        # acc_av = sum(np.diag(cm)) / len(labels_test)
        acc_av = sum(class_cm) / len(class_cm) #yields acc similar to score_kNN's result in function above 
        acc = [[class_cm.tolist()], acc_av]
        return acc #classwise accuracy = number of correct predictions / total predictions for a class 
In [14]:
%%time
#Test class 
model = KNN(k=3)
model.fit_kNN(images_train, labels_train)
pred= model.predict_kNN(images_test)
#model.score_kNN(images_test, labels_test)
#return accuracy for each class (10 numbers) + average accuracy
acc= model.class_acc(images_test, labels_test)
acc
CPU times: total: 4min 55s
Wall time: 5min 7s
Out[14]:
[[[0.9683673469387755,
   0.9947136563876652,
   0.8517441860465116,
   0.8702970297029703,
   0.8462321792260692,
   0.8295964125560538,
   0.9613778705636743,
   0.8949416342412452,
   0.8049281314168378,
   0.8909811694747275]],
 0.8913179616554532]
In [15]:
#average accuracy
model.score_kNN(images_test, labels_test)
Out[15]:
0.8931
In [16]:
def l2_dist( x1, x2):
        return np.sqrt(np.sum((x1-x2)**2))
    
In [17]:
# Function that implements kNN to predict the
# class (0 to 9) for a given test image, so it is a 10-way classification problem
# and reports the accuracy for each class (10 numbers) as well as the average accuracy (one number).

def KNN(images_train, labels_train, images_test, labels_test, k):
    pred_knn = [] # a list to store predictions
    
    #iterate through test data
    for i in range(len(images_test)):

        ##calculate distances between test and training data 
        d = np.array([l2_dist(images_test[i], x_t) for x_t in images_train])

        #sort distances and only keep k neighbors 
        sorted_d = d.argsort()[:k]

        #initialize dict to count class occurrences in labels_train
        cnt_neighbors = {}

        #iterate over neighbors 
        for j in sorted_d:
            if labels_train[j][0] in cnt_neighbors:
                cnt_neighbors[labels_train[j][0]] += 1
            else:
                cnt_neighbors[labels_train[j][0]] = 1

        #sort occurence values from most to least
        sorted_neigh_cnt = sorted(cnt_neighbors.items(), key=operator.itemgetter(1), reverse=True)
        pred_knn.append(sorted_neigh_cnt[0][0]) 
    
    ##classwise accuracy and average accuracy 
    #convert labels_test as nested list to flat array list 
    flat_labels_test = [ item for elem in labels_test for item in elem]
    cm = pd.crosstab(flat_labels_test,pred_knn)
    #print(cm)
    class_acc = np.array((cm.max(axis=1)/cm.sum(axis=1)))
    #overall_acc = sum(pred_knn == np.array(flat_labels_test)) / len(labels_test)
    # acc_av = sum(np.diag(cm)) / len(labels_test)
    acc_av = sum(class_acc) / len(class_acc) #yields acc similar to overall_acc  
    acc = [class_acc.tolist(), acc_av]
    return acc #classwise accuracy = number of correct predictions / total predictions for a class 
In [18]:
%%time
#Test function 
KNN(images_train, labels_train, images_test, labels_test, 3)
CPU times: total: 2min 26s
Wall time: 2min 33s
Out[18]:
[[0.9683673469387755,
  0.9947136563876652,
  0.8517441860465116,
  0.8702970297029703,
  0.8462321792260692,
  0.8295964125560538,
  0.9613778705636743,
  0.8949416342412452,
  0.8049281314168378,
  0.8909811694747275],
 0.8913179616554532]

(c) For k = 1, change the number of training data points (30 to 10,000) to see the change in performance. Plot the average accuracy for 10 different dataset sizes. You may use command logspace in Matlab. In the plot, x-axis is for the number of training data and y-axis is for the accuracy.

In [19]:
img_train,img_test,lbl_train,lbl_test= M['images_train'],M['images_test'],M['labels_train'],M['labels_test']
In [20]:
# Change the number of training data points (30 to 10,000)

#set 10 different sizes 
size = np.logspace(np.log10(30) , np.log10(10000) , num=10)
size = np.around(size)
size = size.astype(int)
size

#initialize empty dictionary for different samples 
img_dict = {} 
lbl_dict = {} 

#loop to get different samples into samples_dict
for s in range(len(size)): 
    inds = np.random.permutation(img_train.shape[0]) #get random indices for each size 
    X = img_train[inds]
    y = lbl_train[inds]
    #n = sizes[s]
    img_dict["img_train{0}".format(size[s])] = X[0:size[s],:]
    lbl_dict["lbl_train{0}".format(size[s])] = y[0:size[s],:]
In [21]:
#print shaped of each sample 
for key in img_dict: 
    print(img_dict[key].shape)
(30, 784)
(57, 784)
(109, 784)
(208, 784)
(397, 784)
(756, 784)
(1442, 784)
(2750, 784)
(5244, 784)
(10000, 784)
In [22]:
#print shaped of each sample 
for key in lbl_dict: 
    print(lbl_dict[key].shape)
(30, 1)
(57, 1)
(109, 1)
(208, 1)
(397, 1)
(756, 1)
(1442, 1)
(2750, 1)
(5244, 1)
(10000, 1)
In [23]:
%%time
# Fit k=1 for different training sizes 
acc_av_dict={} # initialize an empty dictionary for average accuracies 
for s in range(len(size)):
    avg_acc = KNN(img_dict["img_train{}".format(size[s])], lbl_dict["lbl_train{}".format(size[s])] , images_test, labels_test, 1)
    acc_av_dict[size[s]] = avg_acc[1]
    print("size {} done".format(size[s]))
size 30 done
size 57 done
size 109 done
size 208 done
size 397 done
size 756 done
size 1442 done
size 2750 done
size 5244 done
size 10000 done
CPU times: total: 52min 19s
Wall time: 55min 8s
In [24]:
#convert dict to df 
avg_accdf = pd.DataFrame.from_dict(acc_av_dict.items())
avg_accdf.columns = ['Number of training data', 'Accuracy']
avg_accdf
Out[24]:
Number of training data Accuracy
0 30 0.561468
1 57 0.615278
2 109 0.716249
3 208 0.802582
4 397 0.839968
5 756 0.871002
6 1442 0.896835
7 2750 0.921617
8 5244 0.932798
9 10000 0.949314
In [41]:
# Plot the average accuracy for 10 different dataset sizes
# List arguments
fig = px.line(x=avg_accdf['Number of training data'], y=avg_accdf['Accuracy'],title="Accuracy of kNN using k=1 for different training sample sizes")
fig.update_layout(template = 'plotly_dark', 
    xaxis_title="training sample size",
    yaxis_title="kNN accuracy",
    )

fig.show(renderer='notebook')

Observation(s):

The average accuracy is logarithmic. As the sample size increases, the average accuracy coverges to 1.

(d) Show the effect of k on the accuracy. Make a plot similar to the above one with multiple colored curves on the top of each other (each for a particular k in [1 2 3 5 10].) You may use command legend in Matlab to name different colors.

In [26]:
%%time
k_vals = [1, 2, 3, 5, 10] #set diff values of k 
acc_av_dict_per_k= {}

for k in range(len(k_vals)): 
    for s in range(len(size)): 
        avg_acc = KNN(img_dict["img_train{}".format(size[s])], lbl_dict["lbl_train{}".format(size[s])] , images_test, labels_test, k_vals[k])
        acc_av_dict_per_k['{}_{}'.format(k_vals[k], size[s])]= avg_acc[1]
        print('{}_{} done'.format(k_vals[k], size[s]))
1_30 done
1_57 done
1_109 done
1_208 done
1_397 done
1_756 done
1_1442 done
1_2750 done
1_5244 done
1_10000 done
2_30 done
2_57 done
2_109 done
2_208 done
2_397 done
2_756 done
2_1442 done
2_2750 done
2_5244 done
2_10000 done
3_30 done
3_57 done
3_109 done
3_208 done
3_397 done
3_756 done
3_1442 done
3_2750 done
3_5244 done
3_10000 done
5_30 done
5_57 done
5_109 done
5_208 done
5_397 done
5_756 done
5_1442 done
5_2750 done
5_5244 done
5_10000 done
10_30 done
10_57 done
10_109 done
10_208 done
10_397 done
10_756 done
10_1442 done
10_2750 done
10_5244 done
10_10000 done
CPU times: total: 4h 16min 49s
Wall time: 6h 20min 40s
In [34]:
#convert dict to df 
avg_acc_kdf = pd.DataFrame.from_dict(acc_av_dict_per_k.items())
avg_acc_kdf.columns = ['ID','acc']
avg_acc_kdf['ID'] = avg_acc_kdf['ID'].astype('string')
avg_acc_kdf[['k','size']] = avg_acc_kdf.ID.str.split('_', expand=True)
avg_acc_kdf.dtypes
avg_acc_kdf
Out[34]:
ID acc k size
0 1_30 0.561468 1 30
1 1_57 0.615278 1 57
2 1_109 0.716249 1 109
3 1_208 0.802582 1 208
4 1_397 0.839968 1 397
5 1_756 0.871002 1 756
6 1_1442 0.896835 1 1442
7 1_2750 0.921617 1 2750
8 1_5244 0.932798 1 5244
9 1_10000 0.949314 1 10000
10 2_30 0.561468 2 30
11 2_57 0.615278 2 57
12 2_109 0.716249 2 109
13 2_208 0.802582 2 208
14 2_397 0.839968 2 397
15 2_756 0.871002 2 756
16 2_1442 0.896835 2 1442
17 2_2750 0.921617 2 2750
18 2_5244 0.932798 2 5244
19 2_10000 0.949314 2 10000
20 3_30 0.551251 3 30
21 3_57 0.615399 3 57
22 3_109 0.725491 3 109
23 3_208 0.791067 3 208
24 3_397 0.833035 3 397
25 3_756 0.878071 3 756
26 3_1442 0.901486 3 1442
27 3_2750 0.924864 3 2750
28 3_5244 0.939595 3 5244
29 3_10000 0.951342 3 10000
30 5_30 0.589869 5 30
31 5_57 0.606772 5 57
32 5_109 0.720239 5 109
33 5_208 0.777778 5 208
34 5_397 0.823589 5 397
35 5_756 0.876382 5 756
36 5_1442 0.898421 5 1442
37 5_2750 0.925827 5 2750
38 5_5244 0.938605 5 5244
39 5_10000 0.950336 5 10000
40 10_30 0.639373 10 30
41 10_57 0.602814 10 57
42 10_109 0.665238 10 109
43 10_208 0.745905 10 208
44 10_397 0.794459 10 397
45 10_756 0.858533 10 756
46 10_1442 0.886947 10 1442
47 10_2750 0.917808 10 2750
48 10_5244 0.932122 10 5244
49 10_10000 0.946892 10 10000
In [35]:
#Plot 
fig = px.line(x=avg_acc_kdf['size'], y=avg_acc_kdf['acc'],
              color = avg_acc_kdf['k'],
              title="Accuracy of kNN for different k and training sample sizes")
fig.update_layout(template = 'plotly_dark', 
    xaxis_title="training sample size",
    yaxis_title="kNN accuracy",
    legend_title_text='k'
    )

fig.show(renderer='notebook')

Observation(s):

It appears that at sample size 30, k=10 has the highest accuracy but it get the lowest accuracy as the sample size increases. The values k=3,5 have the highest accuracy at the highest sample size. There is also an overall decrease in accuracy in different values of k at sample size 57.

(e) Choose the best k for 2,000 total training data by splitting the training data into two halves (the first for training and the second for validation). You may plot the average accuracy wrt k for this. Note that in this part, you should not use the test data. You may search for k in this list: [1 2 3 5 10].

In [36]:
# For the last part only, please choose 2000 training data randomly and use those only

#split training data into training and validation 
#choose any 2k images out of 60k and split them in half
inds = np.random.permutation(images_train.shape[0])
images_train = images_train[inds]
labels_train = labels_train[inds]
images_train = images_train[0:2000,:]
labels_train = labels_train[0:2000,:]

#training images and labels 
i =np.random.permutation(images_train.shape[0])
images_train = images_train[i]
labels_train = labels_train[i]
img_tr = images_train[0:1000,:]
lbl_tr = labels_train[0:1000,:]
i =np.random.permutation(images_train.shape[0])
images_train = images_train[i]
labels_train = labels_train[i]
img_vl = images_train[0:1000,:]
lbl_vl = labels_train[0:1000,:]
In [37]:
%%time
#use training data to find nearest neighbors 
#use validation data to find best k 
cv_dict= {}
folds = 2 
img_training = img_tr
lbl_training = lbl_tr
img_validation = img_vl
lbl_validation = lbl_vl

for f in range(folds):
    print('fold {} running'.format(f))
    for k in range(len(k_vals)): 
            acc = KNN(img_training, lbl_training, img_validation, lbl_validation, k_vals[k])
            cv_dict['{}_{}'.format(f, k_vals[k])]=acc[1]
            print('k = {} done'.format(k_vals[k]))
    img_training = img_vl
    lbl_training = lbl_vl
    img_validation = img_tr
    lbl_validation = lbl_tr
fold 0 running
k = 1 done
k = 2 done
k = 3 done
k = 5 done
k = 10 done
fold 1 running
k = 1 done
k = 2 done
k = 3 done
k = 5 done
k = 10 done
CPU times: total: 2min 5s
Wall time: 2min 8s
In [38]:
#convert dict to df 
cv_accdf = pd.DataFrame.from_dict(cv_dict.items())
cv_accdf.columns = ['id', 'Accuracy']
cv_accdf['id'] = cv_accdf['id'].astype('string')
cv_accdf[['fold','k']] = cv_accdf.id.str.split('_', expand=True)
cv_accdf
Out[38]:
id Accuracy fold k
0 0_1 1.000000 0 1
1 0_2 1.000000 0 2
2 0_3 0.970020 0 3
3 0_5 0.929642 0 5
4 0_10 0.895525 0 10
5 1_1 1.000000 1 1
6 1_2 1.000000 1 2
7 1_3 0.970020 1 3
8 1_5 0.929642 1 5
9 1_10 0.895525 1 10
In [50]:
#average accuracies per k 
cv_avgaccdf = cv_accdf.groupby('k', as_index=False)['Accuracy'].mean()
cv_avgaccdf.sort_values(by=['Accuracy'], ascending=False, inplace=True)
cv_avgaccdf
Out[50]:
k Accuracy
0 1 1.000000
2 2 1.000000
3 3 0.970020
4 5 0.929642
1 10 0.895525
In [51]:
#Plot 
fig = px.line(x=cv_avgaccdf['k'], y=cv_avgaccdf['Accuracy'],
              title="Accuracy of 2-fold cross-validate kNN using different k values")
fig.update_layout(template = 'plotly_dark', 
    xaxis_title="k",
    yaxis_title="kNN accuracy",
    )

fig.show(renderer='notebook')

Observation(s):

We can see that at lower values of k (k=1,2) , the KNN tends to closely follow the training data and thus shows a perfectly high training score. This is overfitted so the optimal k would be k=3,4.